Multi-level checkpointing and silent error detection for linear workflows

نویسندگان

  • Anne Benoit
  • Aurélien Cavelan
  • Yves Robert
  • Hongyang Sun
چکیده

We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for failstop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm. Key-words: resilience, fail-stop errors, silent errors, multi-level checkpoint, verification, dynamic programming. ∗ École Normale Supérieure de Lyon † INRIA, France ‡ University of Tennessee Knoxville, USA Checkpoint multi-niveaux et détection des erreurs silencieuses pour des graphes de tâches linéaires Résumé : Les erreurs fatales et silencieuses ne peuvent plus être ignorées sur des platesformes à grande échelle. Des techniques de résilience efficaces doivent accommoder les deux types d’erreurs. Une approche traditionnelle de checkpoint et points de reprise peut être utilisée, en rajoutant des vérifications afin de détecter les erreurs silencieuses. Une erreur fatale entraîne la perte de tout le contenu mémoire, d’où l’obligation de faire une sauvegarde sur un support fiable (typiquement un disque). Pour gérer plusieurs types d’erreurs fatales, nous utilisons une approche de checkpoint multi-niveau sur supports stables. Par contre, nous utilisons des checkpoints en mémoire pour les erreurs silencieuses, ce qui donne des surcoûts bien plus faibles. De plus, les détecteurs récents offrent des mécanismes de vérification partielle, qui sont moins coûteux que les vérifications garanties, mais qui ne détectent pas toutes les erreurs silencieuses. Nous montrons comment combiner toutes ces techniques pour des applications HPC dont le graphe de dépendances est une chaîne de tâches, et nous donnons plusieurs algorithmes de programmation dynamique qui renvoient la solution optimale en temps polynomial. Des simulations démontrent que l’utilisation combinée de checkpoint multi-niveaux et de vérifications améliore la performance. Mots-clés : résilience, erreurs fatales, erreurs silencieuses, checkpoint multi-niveaux, vérification, programmation dynamique. Multi-level checkpointing and silent error detection for linear workflows 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coping with silent errors in HPC applications

This report describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extremescale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. ...

متن کامل

Two-level checkpointing and partial verifications for linear task graphs

Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback recovery approach can be used, with added verifications to detect silent errors. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an extern...

متن کامل

A backward/forward recovery approach for the preconditioned conjugate gradient method

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP’13, pp. 167–176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. W...

متن کامل

Efficient checkpoint/verification patterns for silent error detection

Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application ...

متن کامل

Silent error detection in numerical time-stepping schemes

Errors due to hardware or low level software problems, if detected, can be fixed by various schemes, such as recomputation from a checkpoint. Silent errors are errors in application state that have escaped low-level error detection. At extreme scale, where machines can perform astronomically many operations per second, silent errors threaten the validity of computed results. We propose a new pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017